Unlocking Insights into London's Hotel Market

Based on data from Booking.com

Candidate Number: 54454, 49823, 53900

Table of Contents


1. Introduction

The tourism industry plays a significant role in the global economy, and the hotel and accommodation sector is a critical part of it. According to House of Commons Library (2022), just before Covid-19, the hospitality industry contributed $59.3 billion or 3%, to the UK's overall economic output. In each country and region, around 5% of enterprises were in the hospitality industry. London, as one of the most visited cities, has a thriving hotel and accommodation industry. The industry's growth has been fueled by London's status as a leading global financial centre and a top travel destination, making it an exciting and dynamic market to study.

The hotel and accommodation industry is highly competitive, with a wide range of service providers offering various types of accommodations at different price points. As such, industry players must understand consumer preferences and market trends to remain competitive and profitable. This study aims to explore and analyze the London hotel and accommodation market to gain insights into consumer behaviour, market dynamics, and trends, using data from Booking.com, a leading online travel agency. Some initial questions this report aims to answer include:

In terms of originality, several studies have been conducted on the London hotel and accommodation market. However, this study is unique in that it analyzes a dataset from the perspective of the world's leading online travel agency to gain insights into consumer behaviour, pricing trends, and market dynamics. This study's originality lies in its use of a diverse set of variables, including hotel information, consumer reviews and public open-source data, to gain a comprehensive picture of the market.


2. Data Acquisition and Description

This project uses quite comprehensive data, all of which are stored in the data folder of the current working directory. Some of them are obtained through API and some are from official channels. This section will give a brief introduction to each of these.

2.1 Hotel's Data (Main Dataset)

Collected on 28 Jan 2023

The main dataset used in this project consists of hotel data obtained from the Booking.com API available on RapidAPI. Detailed process of how we obtain the data could be refered at: ST445Project_DataAcquisition.ipynb in the current working directory.

The data obtained is stored at data folder with name data.json.

It includes hotel listings in London. Each hotel listing provides information such as the hotel name, star rating, property type, room types, availability for a specified check-in and check-out date, review scores rated by guests, pricing information, etc. It also includes information on the hotel's geographical location, such as the district and zip code. More information about the features of our data can be found in the Data Preparation/Cleaning (Main Dataset) section.

The data is organized in a JSON format and consists of a list of hotel dictionaries, with each dictionary containing multiple key-value pairs that correspond to the hotel's various attributes.

2.2 Hotel Review Data

Collected on 18 Feb 2023

After finishing their stay, guests were welcomed to write reviews about "what did you like" and "what didn't you like". They correspond to positive and negative reviews respectively.

The hotel review data included 10000 rows of reviews (both positive and negative reviews) from 400 hotels (25 rows of reviews per hotel). It was collected on 18 February. And it was collected by the same API through the Reviews of the hotel endpoint.

The hotel_ids_for_reviews.json file includes a hotel_id list of these 400 hotels. It was used to retrieve hotel review data through API (hotel_id is a parameter in an API call). More detailed information/codes about how to select these 400 hotels can be found at the beginning of the Hotel Review Analysis of this notebook.

The code (including how to update API key) used to collect the hotel review data through API can be found in another notebook called ST445Project_ReviewDataAcquisition.ipynb.

The data was stored at data folder with name review_data.json. It could provide valuable insights into the customer experience at each hotel.

2.3 Other Public Data

There are also other datasets involved for cleaning/analysis purposes. They are all obtained through public channels:

  1. NSPL21_NOV_2022_UK.csv: This dataset contains the National Statistics Postcode Lookup (NSPL) for the UK as of November 2022. It provides a comprehensive list of postcodes in the UK, along with their corresponding administrative geography codes, such as the local authority, county, and region. The data can be downloaded here.

  2. LA_UA names and codes UK as at 04_21.csv: This dataset contains the names and codes for all local authorities in the UK as of April 2021. It includes the local authority or unitary authority name, the local authority or unitary authority code, and the region that it belongs to. The data can be downloaded here.

  3. london_boroughs.json: This dataset contains the geographical boundaries for the 32 London boroughs in GeoJSON format. It is used for geographical analysis. The data can be downloaded here.

  4. London Area Profiles (folder): This dataset contains various socio-economic indicators for different boroughs in London. It is used to provide additional context and insights for the analysis of the hotel data. The data can be downloaded here.


3. Data Preparation/Cleaning (Main Dataset)

3.1 Dataset Overview

3.2 Removing redundant columns

First of all, we did some handy work by filtering out 27 out of 90 features, which we think might be useful for later analysis.

Some of the features may not be as intuitive by glancing at its name, detailed explaination is given:

Feature Comment
distance_to_cc Distance to city centre (in km, round to 0.05)
preferred Preferred Partner Programme is an exclusive programme that gives greater visibility to the top 30% of partners
preferred_plus Preferred Plus is the premium tier of Preferred Partner Programme
hotel_has_vb_boost Visibility Booster is a marketing tool which allows partners to increase their visibility.
review_nr Number of reviews
class Star rating, ranging from unrated to five-star hotels
has_free_parking Has free parking or not
review_score_word Review score word (e.g., Very good, Good)
review_score Review score (out of 10)
is_mobile_deal Has mobile deal or not
mobile_discount_percentage Discount if booked by mobile
price_is_final Is the price shown the final price
min_total_price Minimum price for any room types for a give period of stay
ribbon_text Breakfast included
urgency_message Limited rooms remaining (e.g. Only 1 left at this price on Booking.com)
cpc_non_trader_copy Whether the property is professional/private host
unit_configuration_label room type, info about the room

3.3 Drop duplicates

Some columns in the dataframe are in unhashable data types. Before dropping duplicates, we need to transform them to make sure their values are comparable.

3.4 Clean individual columns and deal with NaN values

There is information about the room type and the number of beds in the 'unit_configuration_label' column, which we think might be useful for further analysis.

Here we can have a look at the values in the unit_configuration_label column. Basically, the related information about the number of beds are in the formats of '((1\xa0bed)|(\d{1,2}\xa0beds)|(1 double or 2 singles))', in which we use regular expression pattern. For the room type, we can extract the texts before '\
'.

Extract information about room types:

Extract information about the number of beds:

3.5 Clean zip and district

Noticing that null values compromise a large part and some district notations are overlapping. (e.g., Acton forms part of Ealing), to be more precise in our analysis, here we convert the district notation to a more standardized local authority level from the NSPL (National Statistics Postcode Lookup) database.

This suggests that 22 postcodes cannot be found in the national database, either not recorded or misrecorded. We propose to use the geopy module to correct those anomalous data points (GeeksforGeeks, 2022).


The final cleaned dataset which will be analyzed later on is named as df3.


4. Data Analysis

In this Data Analysis section, we delve deeper into the wealth of information available on Booking.com to gain insights into the hotel industry in one of the world's most popular travel destinations, London. By analyzing data on hotel pricing, class ratings, location, and amenities, we aim to provide insights into the trends and patterns that shape the tourism industry in London.

Throughout this analysis, we will use a range of statistical techniques and visualization tools to uncover interesting relationships and trends in the data. Our goal is to provide a comprehensive picture of the hotel industry in London, from the types of properties available to the factors that influence pricing. Ultimately, we hope to offer valuable insights for anyone interested in understanding the dynamics of the London hotel market.

4.1 Hotel Accommodations by Star Ratings and Property Types

First of all, we provide an overview of the types of accommodations available in London hotel market, including hotels, apartments, and other property types. We also analyzes the distribution of star ratings among hotels in London, which can provide insights into the quality of services and amenities offered by different hotels. The key questions will be answered are:

Based on these plots, we can see that more than half of the hotels in London don't have a star rating. And for those hotels that have a star rating, most of them are 3, 4 and 5-star. Only 4.5% of the hotels are 2-star rated and only 2 hotels are 1-star rated.

The dominance of 3, 4, and 5-star hotels in the city's market suggests that there is strong demand for high-quality accommodations among tourists visiting London. This is in line with the city's reputation as a top travel destination and a hub for business and cultural activities.

On the other hand, the fact that more than half of the hotels in London don't have a star rating is surprising, as it may suggest that a large proportion of the city's hotel market is either unregulated or doesn't meet the minimum criteria for a star rating. An intuitive guess is that many of these properties are small, independent establishments that don't have the resources or capacity to meet the requirements for a star rating. To validate our initial guess, we further break down those unrated properties into the analysis.

We can see from the above table that more than half of the hotels in London don't have a star rating, they are apartments. It might be because those relatively small and independent apartments don't have the resources or desire to go through the process of obtaining a star rating. Or, maybe the criterion for apartment star rating is more strict than hotels, thus most of the apartments don't have a star rating.

To conclude, a rating can be a helpful guide for London travellers looking for a certain level of quality and amenities in their accommodation. However, it's also important to bear in mind that the lack of a star rating doesn't necessarily indicate poor quality, and there may be some hidden gems among unclassified hotels. Next, let's take a look at the price distribution by different star ratings.

It is noteworthy that from the boxplot above, the outliers distort the representation of the data distribution and make it difficult to interpret the summary statistics. They also affect the scaling of the axis and make it hard to compare different parts of the distribution. Therefore, to mitigate the negative impact of outliers on data visualization, prices are divided into 4 levels according to 25%, 50% and 75% quantiles. (Detailed analysis of outliers will be included in the next section.)

We can see from the left graph above that:

We can see from the right-hand side violin plot above that:

The evidence presented in the previous sections offers valuable insights for travellers seeking to book a hotel in London. One notable observation is that hotels with higher star ratings tend to charge higher prices and are marked with higher review scores due to their superior services and facilities, and vice versa. The fact that unrated hotels have evenly distributed prices and long-ranging review score distribution supports our explanation that the absence of a rating does not necessarily imply poor service or quality, but indicates the importance of carefully checking reviews and property details before booking.

Next, we break down the dataset by different kinds of accommodation types available on Booking.com to understand key differences among them.

There are some results observed from the tables, for example:

These findings suggest that there is a wide range of accommodation types available on Booking.com, each with its unique advantages and disadvantages.

We observed some patterns from the strip plots, for example:

4.2 Hotel Pricing and Detection of Outliers

In this section, we explore the distribution of hotel prices in London and identify potential outliers in the data. By understanding the overall pricing trends and identifying unusual data points, we can gain a better understanding of the range of prices and the factors that may influence them.

It is noteworthy that some of those data points deviate significantly from the rest of the data. We should pay special attention to those outliers before further analysis. For example, in histograms, the x-axis would be messed up, and thus influence the overall quality and reliability of our later analysis.

The Freedman–Diaconis rule can be used to select the appropriate number of bins to be used when plotting a histogram.

Based on the plots, we can observe that most hotel prices in the region are within the 75-200 range, and they do not follow a normal distribution, as evidenced by the right-skewed histogram. This suggests that a greater number of hotels in the region have prices that are clustered towards the lower end, with only a few hotels charging significantly higher prices.

In an economic sense, it could be an indication of the presence of luxury hotels that are targeting high-end customers seeking premium experiences, or it could signify limited competition in the region, with only a few hotels dominating the market and charging higher prices.

A quite large amount of the outliers fall into the property category of Apartment. Regaring star rating, the majority of them are unrated and five-star rated properties. Sample outliers include Mandarin Oriental Hyde Park, Shangri-La, Rosewood London, which are conventionally known as high-end hotels.

4.3 Correlation Analysis

By analyzing the correlation map in this section, we can gain insights into the relationships between different features in the dataset. We explore interesting correlations between different features and discuss possible explanations for these relationships. Questions that will be answered here include:

The correlation heatmap and matrix provide interesting insights that can be grouped into five main points:

4.4 Geographical Analysis

This section further provides an overview of the spatial distribution of hotel properties in London. We can gain insights into:

In this map, the colour represents the price level of a given hotel, with blue being the lowest level and red the highest.

At a very first glance, some areas in London, such as Mayfair and Knightsbridge are seen to be concentrated by high-end hotels (The majority of the hotels are labelled as yellow and orange). These high prices may be attributed to their convenient location, exclusive atmosphere, and luxury amenities. For example, these areas are home to designer boutiques, and Michelin-starred restaurants and are located near popular tourist attractions like Buckingham Palace and Hyde Park.

On the other hand, areas such as Bayswater and Paddington are generally known to have lower hotel prices (The majority of the hotels are labelled as blue). This may primarily be because they are located slightly away from the city centre and major tourist attractions.

The districts with the Top 3 number of hotels are: Westminster, Camden and Kensington and Chelsea.
The districts with the Top 3 greatest average prices are: Wandsworth, Bexley and Westminster.

Noticing that in Bexley, there is only one hotel, which is not representative, therefore, here we use histgrams and boxplots to give a more detailed and illustrative analysis of hotel pricing by district.

Next, to be more precise in our analysis, we import some socio and economic factors by different district to see how those influence hotel price and distribution.

Some of the explanations of the description by district, when you hover over the map:

Feature Comment
inner_statistical true indicates that the corresponding borough is an "Inner London".
greenspace % of area that is Greenspace
population_density Population per hectare
tspt_access Average Public Transport Accessibility Score
house_price Median price paid for all house types
crime_rate All Crime Rate
population Total Population
avg_pay Mean Annual pay

Some interesting clues could be found in this correlation heatmap, for example:

4.5 Affordable and Convenient Location Properties

This section identifies hotels that offer the best value for money based on their location, review score, and price. By focusing on properties that are both affordable and conveniently located, we can help travellers find the best possible options for their stay in London.

review_score_word is a column containing some description words based on the guests' review scores:

review_score_word review_score
Exceptional 9.5+
Superb 9+
Fabulous 8.5+
Very good 8+
Good 7+
Pleasant 6+
Passable 5+
Disappointing 4+
Poor 3+
Very poor 2+
Bad 1+

Since they are based on the average review scores given by guests, we will only look at those with at least 10 reviews (df3.review_nr>=10) to reduce bias

We created interactive visualisation using plotly.py, which is an interactive graphing library for Python. (Plotly, n.d.)

Outliers would greatly affect the visualisation. Here is the scatter plot including the outliers:

If we do not show the outliers, the scatter plot would be more clear:

From the interactive scatter plot above, we could see that:

4.6 Hotel Review Analysis

In this section, we would analyse the hotel review data, which includes positive review analysis and negative review analysis. For each part, we mainly generated a word cloud and performed topic modelling using Latent Dirichlet Allocation (LDA) (Kapadia, 2019).

The hotel review data included 10000 rows of reviews from 400 hotels (25 rows of reviews per hotel). The 400 hotels were selected by the ranking of the number of reviews(review_nr), which ensured that each hotel had at least 25 reviews for us to retrieve.

The code below shows how we generated the hotel_ids_for_reviews.json file. This file includes a hotel_id list of the 400 hotels we mentioned above. It was used to retrieve hotel review data through API. (hotel_id is a parameter in an API call)

Review Data Cleaning

In this subsection, we would perform review data cleaning. We only kept the useful columns (e.g. pros, cons), and then defined some functions for text processing.

Firstly, we remove redundent columns that would not be useful for our text data analysis.

Then, we need to pre-process the texts, which mainly includes

Positive Review Analysis

In this subsection, we targeted at positive reviews.

Word Cloud

Based on the word cloud and the nature of the data, we could extend our stopwords list to further remove unnecessary words for further analysis. In our positive review data, words like 'great', 'good' and 'nice' are not useful for topic modelling, and may affect the accuracy of the model, so we decided to treat them as stopwords to remove them.

Topic Modelling - LDA

Have a look at the dictionary (Only showing the first 10) we generated:

Here is a Bag-of-words example. We randomly picked a review for demonstration.

Examining the top 10 words in each topic for positive reviews, we can conclude from the bar plots that:

t-SNE

The plot above shows the LDA result using t-SNE clustering, which is a popular statistical method for visualising high-dimensional data. By hovering over the data points, original positive reviews can be seen. Different colors represent different topics extracted from the LDA model.

As we can see from the t-SNE plot, the data points are not separated very well, since the underlying patterns in the data are complex and subtle.

Negative Review Analysis

Next, we followed the similar procedure as above to analyse negative reviews.

Word Cloud

From the word cloud, we can think of some complaint about small room and breakfast. Next, we will try to extract some topics related to these issues.

Topic Modelling - LDA

Examining the top 10 words in each topic for negative reviews, we can conclude from the bar plots that:

As expected, these extracted topics are common issues that guests are frequently complaining about. To improve the satisfaction of guests, managers should upgrade their hotels correspondingly.

t-SNE

Similarly, the plot above shows the LDA result for negative reviews using t-SNE. It seems that it is better than the one for positive reviews in terms of overlapping between different topics.

The limitation of this subsection is that the number of reviews we got is not large enough, due to the limited calls of API, and the computational complexity when performing lemmatization. Another limitation is the nature of the hotel review data. Each comment often contains multiple elements/topics, which causes data points in the t-SNE plot not to be very well separated. In addition, some reviews were too short, and some pros were even miswritten into cons or vice versa. Hence the accuracy of our analysis had a bottleneck. We may increase our number of reviews, and improve the data quality in the future for improvement.

4.7 Machine Learning Models for Hotel Price Prediction

The last section of this report focuses on using machine learning models to predict hotel room prices in London. Models used include:

Ultimately, the section aims to provide valuable insights into the effectiveness of these models in predicting hotel prices in London, and the impact of various features on the hotel price.

Data Preparation for Modelling Purpose

Model 1: Linear Regression

Linear regression is a popular machine learning algorithm that is used to predict numerical variables based on a set of input features. Here it can provide valuable insights into the key drivers of hotel prices.

Model 2: Ridge Regression

Ridge regression is a regularized version of linear regression that can help to prevent over-fitting, which can be used to improve the accuracy of the linear regression model by adding a penalty term to the cost function.

Model 3: Lasso Regression

Lasso regression is another type of regularized linear regression that can help to prevent over-fitting by shrinking the coefficients of the input features towards zero.

Model 4: Polynomial Regression

Polynomial regression is a variation of linear regression that allows for nonlinear relationships between the predictors and the outcome.

When attempting to forecast how the price of a hotel will change as the distance from the city centre (distance_to_cc) increases, for instance, it is sometimes the case that hotels closer to the city centre will see price increases that are more rapid, or that the relationship is with a quadratic or cubic term that Linear Regression was unable to capture.

Model 5: SVM (Support Vector Machine) Regression

Being similar to Polynomial Regression, Support Vector Machine (SVM) regression is a type of machine learning model that is particularly useful in cases where there is a nonlinear relationship between the predictors and the outcome.

Model 6: XGBoost Regression

XGBoost, which is short for Extreme Gradient Boosting, is a powerful boosting algorithm. It works by building an ensemble of decision trees that are trained on different subsets of the data, with a relatively low bias.

Model 7: Random Forest Regression

Random Forest is another ensemble method that works by building a large number of decision trees and combining their predictions to produce a final outcome.

Evaluation Metrics

The bar charts shown above represent the evaluation metrics of each model. We could see that both Random Forest and XGBoost had superior performance compared to the others. XGBoost performed slightly better on MAE, while Random Forest had much lower RMSE and higher Accuracy. Therefore, we will suggest the Random Forest Regressor model for further analysis, although XGBoost is also a good choice.

Feature Importance

The feature importance of the different variables in the models is then analysed.

We can see from the horizontal bar chart above that review_score, class, num_of_bed, distance_to_cc and mobile_discount_percentage have the biggest impact on our model.

This provides insights into which variables are the most important for predicting hotel prices, and can help guide future feature engineering or data collection efforts. For example, known that mobile_discount_percentage is a significant factor in predicting hotel prices, hotels could consider adjusting their pricing strategies for mobile users.

Conclusion

To summarise, in order to acquire insights into numerous facets of the hotel sector in this wonderful city, we examined a comprehensive data of hotels in London. Prior to doing a thorough data analysis, we cleaned and prepared the data. By conducting an in-depth analysis into the data, we obtained a number of intriguing results concerning the hotels in London.

We found valuable insights into the London hotel market, for example:

There are also several limitations and further actions to consider, for example:

In conclusion, this report provides valuable insights into the hotel industry in London, highlighting important trends and factors that affect hotel pricing and customer satisfaction. Our findings and the machine learning model can be used by hotel owners, industry professionals and travellers to make more informed decisions.

References

GeeksforGeeks. (2022). Get the City, State, and Country names from Latitude and Longitude using Python. [online] Available at: https://www.geeksforgeeks.org/get-the-city-state-and-country-names-from-latitude-and-longitude-using-python/ [Accessed 4 Jan. 2023].

GeeksforGeeks. (2023). Removing stop words with NLTK in Python. [online] Available at: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/ [Accessed 19 Feb. 2023].

Gensim Tutorial. (n.d.). [online] Available at: https://tedboy.github.io/nlps/gensim_tutorial/tutorial.html [Accessed 19 Feb. 2023].

House of Commons Library. (2022). Hospitality industry in the UK: pre-pandemic statistics. [online] Available at: https://commonslibrary.parliament.uk/research-briefings/cbp-9111/#:~:text=Hospitality%20industry%20in%20the%20UK%3A%20pre%2Dpandemic,in%20each%20country%20and%20region. [Accessed 24 Feb. 2023].

Kapadia, S. (2019). Topic Modeling in Python: Latent Dirichlet Allocation (LDA). [online] Towards Data Science. Available at: https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0 [Accessed 18 Feb. 2023].

Plotly. (n.d.). Getting started with Plotly in Python. [online] Plotly. Available at: https://plotly.com/python/getting-started/ [Accessed 18 Feb. 2023].

spaCy. (n.d.). spaCy 101: Everything you need to know. [online] Available at: https://spacy.io/usage/spacy-101 [Accessed 19 Feb. 2023].